WORLD INTELLECTUAL PROPERTY ORGANIZATION 
IntcmauonaJ Bureau 




PCX 

INTCRNATONAL APPUCATION PUBUSHED UNDER THE PATENT COOPERATION TTIEATY (PCT) 



(51) International Patent Classification 7 
G06F 19/00 



A2 



(11) International Publication Number: 
(43) International Publication Date: 



(21) IntemaUonai AppUcaUon Number: PCT/USOO/00167 

(22) International Filing Date: 5 January 2000 (05.01.00) 



(30) Priority Data: 
60/114,806 
Not furnished 



5 January 1 999 (05 .0 1 .99) US 
4 January 2000 (04.0 1 .00) US 



(63) Related by ContinuaUon (CON) or ContinuaUon-in-Part 
(CIP) to EarUer Applications 

^.f^ 60/114.806 (CIP) 

J-iicd on 5 January 1999 (05.01 .99) 

^.T. Not furnished (CIP) 

°" 4 January 2000 (04.01.00) 

(71) Applicant {for all designated Slates except US): CURAGEN 

CORPORATION [US/US]; 11th floor, 555 Long Wharf 
Dnve, New Haven, CT 0651 1 (US). 

(72) Inventors; and 

(75) Inventors/Applicants 6^or US only): BADER, Joel, S [US/US]- 
36 Ogden Road, Stamford, CT 06903 (US), LIU Yi 
{J^^x ^^i'^'''^ Prospect Street. #53. New Haven, CT 06511 
R ^^J^^^'.^'f^^^" f^^^^^^ 36 Whitting Farm Road, 
Branford, CT 06405 (US). DZIUDA, Darius [PI7US]- 1298 



WO 00/41122 

13 July 2000(13.07.00) 



Hartford Turnpike, 9E, North Haven. CT 06473 (US) 
GUSEV, Vladimir [UAAJS]; 1209 Durham Road. Madison* 
CT 06443 (US). JUDSON, Richard, S. [US/US]; 42 Barker 
Hill Driven Guilford. CT 06437 (US). WENT. Gregoo', T 
[US/US]; 34 Scotland Avenue. Madison. CT 06443 (US). 

(74) Agent: ELRIH. Ivor. R.; Mine, Uvin. Cohn. Ferris, Glovsky 
mid Popco. P.C., One Financial Center. Boston, MA 021 1 1 
' (US). 



(81) Designated States: AE. AL. AM, AT. AU. A2. BA BB BG 

« '^^Af C2. DE, DK. dm; EE,' 

ES. n, GB, CD. GE, GH, GM. HR. HU. ID, IL IN IS JP 
KB^ KG. KP. KR. K2. LC. LK, LR. LS, Lt! LU. LV Ma' 
MD. MG. MK. MN. MW, MX, NO, N2, PL. PT RO Ru' 
SD. SE, SG, SI. SK. SL, TJ. TM. TRriT. TZ. UA UG 
US. UZ. VN. YU. ZA, 2W. ARIPO patent (GH, GM, Ke' 

AZ. BY, KG. KZ, MD. RU. TJ. TM). European patent (AT 
BE, CH, CY. DE. DK, ES, H. FR, GB. GR. m n LU 
MC, NL, PT. SE), OAPI patent (BF, BJ, CF. Cg' crCM* 
GA, GN, GW. ml. MR, NE, SN, TD. TG). 

Published 

Without international search report and to be republished 
upon receipt of that report. 



(54) Title: NORMALIZATION. SCAUNG. AND DIFFERENCE HNDING AMONG DATA SETS 




DtSattTOAnON 



INOtVIMlALAVtlUCC 



CKOVP AVCIUCC 



i 



5 

Q 




bufi) 



Ml) 
Ml) 



Ma) 
MO 
MP) 



MO 



MD 
MD 
M>} 





(57) Abstract 

-ss.^^ ^ «... 



FOR THE PURPOSES OF INFORMATION ONLY 
Codes used to identify States party to the PCT on the front pages 



AL 

AM 

AT 

AU 

AZ 

BA 

BB 

BE 

BF 

BG 

BJ 

BR 

BV 

CA 

CF 

CC 

CH 

a 

CM 
CN 
CU 
CI 
DE 
DK 
EE 



Albania 

Annenia 

Austria 

Australia 

Azeriaaijan 

Bosnia and Henegovina 

Barbados 

Belgium 

Burkina Faso 

Bulgaria 

Benin 

Brazil 

Belarus 

Ccncral African Republic 

Congo 

Swiuerland 

Cate d'lvoire 

Cimcroon 

China 

Cuba 

Czech RepobUc 
Genu any 
Denmark 
Esumia 



Of pamphlets publishing intemationai applicaUons under the PCT. 



£S 

FI 

FR 

GA 

GB 

GE 

GH 

GN 

GR 

HU 

IE 

IL 

JS 

IT 

JP 

K£ 

KG 

KP 

KR 

KZ 

LC 

U 

UC 

LR 



Spain 
Finland 
France 
Gabon 

Uoited Kingdom 

Georgia 

Ghana 

Guinea 

Greece 

Hungary 

Ireland 

Isnel 

Iceland 

Italy 

Japan 

Kenya 

Democraiic Peopte'i 
Republic of Kovea 
l^^wblic of Korea 
Kazakstan 
Saint Lucia 
Liechtenstein 
Sri Lanka 
Liberia 



LS Lesotho 

LT Lithuania 

LU Luxembourg 

LV Laivia 

MC Monaco 

MD Republic of Moldova 

MG Madagascar 

MK The f tames Yugoslav 
Republic of Macedonia 

ML Mali 

MN Mongolia 
MR Mauritania 
MW Malawi 
MX Mexico 
NE Niger 

NL Nctheriands 

NO Norway 

N2 New Zealand 

PL Poland 

PT Portugal 

RO Romania 

Ruuian Federajoo 

SD Sudan 

SE Sweden 

SC Singapore 



SI 


Sbvenia 


SK 


Slovakia 


SN 


Senegal 


SZ 


Swaziland 


TD 


Chad 


TG 


Togo 


TJ 


Tajikittan 


TM 


Turkmenistan 


TR 


Turiccy 


TT 


Trinidad and Tobago 


UA 


Ukraine 


UG 


Uganda 


US 


United Slates of America 


uz 


Uzbekistan 


VN 


Viet Nam 


YU 


Yugoslavia 


ZW 


Zimbabwe 



wo 00/41122 

PCT/USOO/00167 

NORMALIZATION, SCALING, AND DIFFERENCE FINDING 

AMONG DATA SETS 

Field of the Invention 

This invention relates to statistical analv<:ic r.f 

isucai analysis of differences between at least two data sets. 

' Related Applications 

The present pa.e„. application c.aitn. pnonty ,„ the United States provision^ patent 
application U.S.S.N. 60/1 14 806 entitlfiH"^..!- "^i patent 

' ''"^'"^^ ^'^'"8 ^dNomialization" filed January' 5 1999 
which IS incorporated herein by reference in its entirety. ' 

Background of the Invention 

Measu^^ents of the expt^sion levels of individual genes within a cell provide a wealU, 
f2-onaho«cellularp,ocesses.Thisis done h, extract 
OnRNA, a cell, possihl, converUng these to .ore stahle cDNA molecules, .d ,^J^ 
*e concenttafo. of each individual species hy ntethod^ such as differential display or 
^b„d..t.o„. A typical analysis strategy is to idenaty genes whose expression lev ,s differ 
^„arhio,ogic^^ 

^™ dtfferences f^. the false differences (.hose due sintply to noise) has presented a 
Menge for gene exptession analysis. The ..levance of this proble. is ^ genes that are 

*^™t,.lyreg.atedcanheconvened.ocon»ercialproducts,includingp:te:r^^^^^^ 
-body targets, drerapeutic markers, as well as conventional dmg targets 

cMenges tn evaluating the tesults appropriately. 

experinl!t°°'!r'1"""'"'"""'°"*'~'^-''*«^»«~ 
=xpenn,e„.». ts that the antonnt of matetial analyzed, such as ntRNA or cDNA cl differ fro. 

^P-ent^experinten^orantongthereplicatesofasingleexperi^en. uJ^T 
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s,r.„gies typically accoun, fo. ,his overall variaL by perfonning a global .caiina of all *e 
measurements from such experimems. For example, if sample A has avice overall cDNA 
concen^ion of sample B, then ,he expression level for a gene in sample B mus, be doubled 

before comparison with sample A. 

Often, however, such an overall scaling is no, sufficient to discriminate between true 
dtfferences and those that can be attributed to noise. One source of difficultv is identifying U,e 
panicular feamres in the data set or sets that can be used as scaling landm^ks. It is no, always 
possible ,0 identify a priori such unchanging feamres ahead of time 

An additional soutce of noise generally presem in experimental studies is noise from 
analytical insmtments and meti,ods. In differentia, display experiments, for example, ti,e amount 
of gene expression is related to U,e amount of PCR product generated in an amplification 
reaction. The amount of product can depend on ti,e activity of ti,e polymerase enzymes as well 
as the lengU, of a fragment being replicated. If the enzyme fitnctions effectively, .he amount of 
PCR product is uniformly high from small fragments to long fragments. If .he enzyme activity is 
less effective, however, ti,e amount of PCR product can be relatively less for long fragments ,han 
for shor. fragments. An overall scaling does no. account for tte non-unifonn upering of d,e 
Signal with the size of the amplicon. 

There thus remains a sm,ng need for counteracting and overcoming the effects of noise in 
comparing data sets in an experimental study. The. is a lack of adequate means for identifying 
constant o, ^.varying components in a data set, which may sen-e as reference markers in 
nonnahztng, scaling and distinguishing differences among data sets in such a sn,dy There 
ftmher is a significant need for a means to identify scaling landmarks automatically in dau seu 
bemg compared .o one anote. In a particular framework addressed in ti,is invention, there is a 
need for robus. meti,ods *a. nonndize scale and find differences in experiment relared to ti,e 
drfferenti^ expression of genes in cells and tissues subjected ,o specific experimenul treatments 
These and comparable needs are addressed by the presem invention. 

Summary of the Invention 

The present invention discloses a metod of identifying a difference between a, least nvo 
groups, wherein each group comprises a dau se. containing ordered element. The me*„d 
mcludes ti,e steps of: ,a) providing a first group having one or mote element in a firs, data sef 
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'-■"-■^<--fo™au„„,saca,c„,a,io„sel.c«dfroma„„™aUd„.calcul.^^^ a. 

In some erabodiments, the method corrects the effecK of ,k. ■ • 
the differences Prior ,n r ■ ""''"'^of *"<"«P"ortodistmgtiishinc 

.Herences. Prtor to nomtaitzaoon, averaging, and/or scaling calculaUons selected r ■ 
a data set that do no, contain ttsefti data ntay be masked ■ T 

-™a.onco„.entma.hehigh,igh.ed.:.:s:i::::e::^^^^^ 
^ -;---"o»-(->ortoohigh<sat™tion,.oL:r^^^^^^^^^ 

shifts of elements between HJff . "oaiment, the jiggle includes positional 

constdeted as stgnalaUgnment between two or more data sets 
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replica.; and in addidonal such »b„dun=„., trace, for each .p,ica,= a.c discredzed prior 
10 the normalizing, the averagmg and/or the scaling. 

In another embodiment, the normalization includes adjusting each data set such ,ha, a 
subset of elements in each data set has similar or identical values. 

In toher embodiments, the averaging includes osculating an average for a location or 
for a disctedzed position across a coUecUon of data sets. THe average may be an unv,.i.hted 

average or a weighted average. 

In additions embodiments, the scaling includes a cSculation that causes a first data set to 
resemble a second data set except that an element in the scaled first data se, whose intensitv 
differs significantly from the intensity of the element in the second dau set at the same location 
or the same position contributes to identifying the difference between the data sets. In fiirther 
embodiments, the scaling includes calculating a distance between the data sets, or calcul^ine a 
sunilari^ between the data sets. In particularly embodiments, the scaling calculation employs a 
scaling fhncnon; and in other embodiments, the scalmg function is a basis set expansion, such as 
a piecewise linear basis set or a direct product of basis functions. 

In yet additional embodiments, successive iteradons of a cycle that includes at least one 
of a nonnSizadon calculation, an averaging calculation and a scaling ctdculation are carried out 
until a specified termination condiuon has been satisfied. ,n particularly embodiments the 
tennination condition istotthet^nsfonneddatasethasconverged. Alternatively the 
temimadon condiUon is that a predetemtined number of iterations has been reached. 

In a tather embodiment, the distinguishing of differences among Uie elements of the 
transfonned data sets includes application of a difference finding algoridim. 

Tie invention also discloses a display means that disptays a .presentation of a difference 
b tween data sets, and also discloses the representation itself, wherein tiie representation is 
obtained m general by applying methods disclosed herein to the data sets. 

Brief Descriition Of The Drawing 

no. 1 is a graphic representation of jiggle arising between two traces. 

no. 2 is a schematic diagram illusti^ring tiie flow ftom different groups to tiie 
transfonned data sets for those groups. 
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no. 3 is a schemadc =sanu,io„ of a, ,h«xperime„ul „o,se ■„ a sc.. 
FIG. 4. is a schen,auc r=„uo„ of avenges of .pfcau raw .aces for «ch 
.nd,v,d„alam»al in Example 1. Prior >o„onnalizado„ or scaling 

FiO. 6. is a schemaUc representation of scaling factors employed „ scale Ac 
Pheno ar .«,.,.„ed individna, average ro *e s,eHle.wa,er.,.a.ed individual average in 
Examplel.obtainedasaresultofiterarivescaling. 

. a.n,a,rc:r"™°""''-'--'-^^^ 

DETAILED Description Of The Invention 

uncontrolled variations exist between data sets. In particular en,!' T 
-nadap^fordifferenriaUisplayexperi^,!:::::^^ 
„ed by ftag^ent intensities in. for exanrple, an electrophoresis «ace Tl lod 
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noise includes low frequency noise. In another embodiment, the noise includes differences in 
jiggle. In the laner embodiment, the jiggle includes positional phase shifts of elements between 
different data sets. The invention also discloses a display means that displays a representation of 
a difference betu-een data sets, and also discloses the representation itself, wherein the 
representation is obtained in general by applying methods disclosed herein to the data sets. 

As used herein, "representation" relates to any graphical, visual, or equivalent non-verbal 
display that provides an image of the results, such as differences betu-een data sets, obtained 
according to the methods of the present invention. More specifically, a "represenution" of the 
invention is obtained by transforming the quantitative results gathered by experiments underlying 
the mvention. Examples of such data include, by way of non-limiting example, traces from 
differential gene expression, and intensities from an array, and/or equivalent types of 
experimental parameter. 

In some embodiments, a representation of the invention is generated by algorithms 
executed in a computer and is suitable for display on a display means, such as a display screen or 
momtor, employed in the operation of the computer. The representation is also suitable for 
storing in a storage module or data archive of such a computer. It is still forther suitable for 
printing from the computer onto a medium such as paper or equivalent physical medium, and for 
recording it onto a portable storage medium, including, for example, magnetic media. CD ROMs 
and equivalent storage media. As used herein, "display means" includes any of the objects and 
media identified above in this paragraph, as well as equivalent apparatuses and objects suitable 
for displaying the results of computational processes for visual inspection. 

In addition, "nomialization" is defined herein as a means for standardizing or correcting 
elements m a data set, for example, but not by way of limitation, for correcting overall signal 
strength within a given data set. Features of given elements to be normalized are first identified 
within a data set For example, one such feature may be the median peak height of signals within 
a data set. A summaiy statistic for the given feature is generated for a data set, and used to 
normalize the elements, as described below, to allow comparisons across data sets. Algorithms 
that are designed to either mask or highlight chosen features identified among the elements of a 
data set may be applied prior to normalization, averaging or scaling. Such features include low 
intensity signal regions that comprise noise, high intensity signal regions that comprise saturation 
zones, and local maxima that comprise peaks. "Averaging" is defined as combining multiple 
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da. se. to generate one average „tive data set. Averages are combined into the ' 
representanve data set in such a way that noise .0. any one data set so combined does not affect 
any other data sets lased to generate the average "Scalin." H.f . 
freauenrvHJfft. • • . ' defined as a correction for lov.' 

frequency difference is signal strength across data 



I sets. 



The da. ™y arise in „f ^ „„„,er of ways. Any experiment o, study in which 

rr""r7^""'*"°''"™^''°""='^=''---"P'oyedinU,einJ,on. S 1 
gn,„ps .ay he d.s,.nsu,shed by me e.peri„en., conditions experienced by U,e respective 

^up.o,hy.heexperi.ent.s.a,echa.c.eH.,.e„e^„ps.;peri„^^^^^^^^^^ 

n.3ybean,„«teot,nanin,a.e,ot„.ybei„anin..esan,p.esderived^n.anin,a,es„biec.s 

'»--™'o^-"<=.<iisplay.eansa„d^™tio„s.,heda,asetssrisefto. 
expetunents conducted in investigaUons in which idenrification of the differentia, exnlsi f 
gene or genes between data sets from at least one, ■ , "'""""^ '"P^^^n of a 

irom at least one expcnmental group and at least one conn-nl 

group ts sought. In cemin embodiments of the invention suchdiff, ,■ , 
GeneCalling™ experiments. See e, United Sta^ T t, 

.,o,..no/. 17.. 79S.g03 (19,9) iT " '■"'•'"^ '"'■^'^ • 

5»UJ(I999). In other embodiments of the invention such differenrt,! 

axpress.onisevalua.edush.gn„c,eicacidmicrcchipa.raysinord.tode^ 

..enceorextentof expression ofageneorgenei^gment. Any ..erna.ive.e.,uiv.^^^^ 
differenoal expression formats and methods of ,n,l„ ■ qmvaient 

Various typ^ Of 3,^^ ^^^^ ^^^^ 

e u u ^ includes relatively high freauencv 

such as that commonly associated with short tim.fl . • X nign irequency 

„,,.u, • , "''^°'^"^""^fl"<=tu«,ons in the electronic and/or 

mechanical components of an experimental system Hi.h fr. 

havJna o <u system. High frequency noise may be defined as 

having a frequency greater than about 1 Hz. Examnl., nfu-u^ 

^^P'^^°fl"gh frequency noise include shof 
noise mphotodetectors and comoarable el.ntr« • J' "ise mcmae shot 

a comparable electromc noise arising in the various electronic 
components .and circuits of an experimental in^ . e'ectromc 

of a data set. ''''''''' ^ ^^^^^^ elements 

Lowfre''"''^^'"''^''""^"""'"^"™^^^^ 
Low frequency noise has a frequency less than «Km.* , it . 

=^ou.O.,„,orless,hanaJt„.„;;2:tl r^^'''^^'*""'^^*^ 

^ or even less than about 0.001 Hz or lower. Such low 



noise 



wo 00/41122 

g PCT/USOO/00167 

ftequency noise during „ ,,p^„,„^ f„ ^^^^^^^ ^^^^^ ^^^^^^ ^^^^^^ 

Al,e„.„veiy, an uncc.pe„.a,.d low frequency change in response of an elecronic ins,™en, 
may anse d»„g *e u,ne in which a da,a se. is being gathered. Additionally, if an anav is bein. 
use .0 genera^ U,= dau sc.. uncompensa,ed vanadons in daeciion across d,e various positions' 
and/or d,„.sions of fte array may arise .ha, behave as low frequency noise (i.e., ,h=y may be 
co^idered as low frequency noise even .hough an array may be subjecud ,o simultaneous 
«ecUo„ of all the sample points on the anay, sirice positional variations behave as if , hey have a 
long wavelength across the array.) Equivalent sources of low frequency noise are also ' 
encompassed in Ms definition. Normalization and scaling algorithms employed are panicularlv 
effecnve m minimizing or eliminating the effects of low frequency noise. 

An additional detrimental effect that may arise in identifying differences between dau 
^.s ts tem,ed -jiggle". By this term ts meant that the elements of one data se. are offset in a 
ongttudin^ direcUon in comparison wiU, .he elements of a second da.a set wid, which the first 
data set is being comp^d. Longitudinal displacement relates .o variaUon ,n *e location or ' 
d,sc.t,zed posidon of a particular feamre in a trace even though Ute fea«re appears in the traces 

ofmore than one group. By way of nonlimiting example, uncompensated variation in the 
2"on or disc^zed position of the feature may occur due to variations in physical or chemical 
condmons dumtg fte process of accumula,ing U,e data elemems of the various data sets being 
constdered Such a variaUon. orjiggle. may be considered to be low frequency noise in the 
iong,n,dmal,orpositional, direction. Jiggle is illustrated in FIG. ,. ,„ disfigure nvo 
^screfized, nomtalized data sets, A(„) and B(n, are shown. A,„) and B(„) shouW be *ough. of 
as each represen.ing the same feature. Nevenheless. they are displayed with ajiggle of , 75 
™.ond,e„axis. „ isanaddiUon^ aspect of the present invention that dte notmaiization and 

thTeff TTr''"" " — N 

*e effects of low frequency longitudinal noise. Such procedures, as employed in the method of 

2 P-nt ■nv.uon. lately or completely eliminate the Jiggle and, referring to FIG. ,, restore 
overlap of the points for A(n) and B(n,. Compensation for jiggle as shown in FIG. 1 is also 
termed signal alignment" 



I sets. 
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Groups, IndWduals, Replicates and Transfonned Data 

A hierarchy of notadon is used herein ,„ indicate data elements a„d,or the data s, 
TTese notaUons are discussed below and furthennore are illustrated in the flow dia^an, 
presented in RG. 2. Raw, i.e. untreated or unttansfonned, data artse fro. the canjin. out the 
cxpernnents on actu^ samples obtained ftom experimental groups. A "group" represe'nts a 
pamcu^ experiments state or condiUon. T*e groups are denoted herein in capital letters A B 
. wtthoutanyindicesordeHmiters. As shown in FIG. 2, atleast two groups comprise the ' ' 
^b.«matteronwhichthem=d,ods.displa.mean3andrep,esentationsof,hepresem 

have ,!^'rT'' '^-^ ->P'« ^ groups 

h^ e^ee^su^ectedtoagiven experimental methodofdetectionor analysis. ExperimenJ 
no, t^sfonned by any calculations of the methods disclosed herein are designated using lower 

.e^ne^togcther with atleastoneindexordelimit=r,-,show,.for example, by ai,bi,..; 
FIG.2). Agroupmaybeinitiallycomposedofoneormo.individuala. Thentmtber f 
. .vrduals is not fixed or constant, but may vary. ^ ,„ ^ , 

mdtvtdual .nay represent an individual anintal, a piant <such as a seedling, or a set of ceils 
gro™ ,n cell or tissue culture. Conespondmgly, for inanimate groups, each individual may 
represent, agarn by way of nonlimiting example, a separate execution of apariicular cxperiLt, 

P^toco sue asasynthericorp..paratiyeprocedure,orthehnp,ementa,ionofapa..cL 

P ystcal condrtrons on separate samples or obiects. Equivalent ways of designating individr 

oagrouparec„compassedwi.hin.hescopeofthepresen.inve„tion.,ngenera,,Lda.ase. 
obtamed ftom the tndividuSs of a gt^up may be transfomted by any one or more of the 

normahzauon, averaging and scaling calculations of this invention in atriving at the differences 
determmed by the present methods. "nerenees 

AS a finder hierarchical subclassification, each individual of a group may funnsh one or 
more replicate samples for detection or analysis accordin.,^.,, yn™shoneor 
,. . , ^'^^'""=<'"'"'8 to the expenmental method employed 

Such rephcates also represent law, or untreated d<,t« P,-t ,• , 

. . . Each repUcate of an individual is 

desrgnated with a second index or delimiter; <iho„^ f„ , . 
. . ■'•*°™'*'"'™>Pte.byaii,bii, .. (seeno 21 

Asshownforillimraion.ineG 2.thenn™i,„ f ,• ' ij. ■•■ HO. 2). 

cire™,» r. ™"''"°'"I''''=^'«°>^y ^■"y <i"e .0 experimental 

crcumstance. Commonly replicates are obtained by repedtive sampUng fiom the same 
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individual. ,„ genera, ^ „b,^.d .on, U,e r^iices of a panic.a. individua, „av b= 

operand upon b, any one or „or= of *e no„on, averaging and scaiing caicuiaUons of .he 
pr«o« .nv.„„o„ in a„ivi„g a, u,o difference de,enni„ed by U,= presen, med,od3 The 
normalisation, averaging and/or .caiing caicuiaUons ^a. are applied ,o U,e rep,ica,e. nrav be 
applred pnor ,o, or .imuhaneousiy wift, d,e similar caicuiadons applied ,o d,e individuals and 
discussed m the preceding paragraph. 

In many of d,e derecion or analytical meftods employed in U« experimems underlying 
me gaftenng of d,e presently disclosed data sets; co„tinuoust.ces of an experimental intensi^ 
as a funcon of a longitudinal variable such as time, elution volume or distance are obtained 
Such traces arise, for example, in the use of Coma^graphic or electrophoretic methods of ' 
detecuon or an^ysis. Since the traces are continuous, each data set may be considered to be 
comprtsed of an infmite number of data elements designated usutg a fttrther deltmiter, a„(x) 
where . denotes the continuous longitudinal dimension of *e analytic^ method. (,t ma be 
noted that use of alternative analytical or detection methods, for example use of arrays L 
d,scre.eposiUonson.hem,does„o.genera,e.contmuousu.e. Such data sets, dterefore in ' 
gene^ need not carry the addiUonal delimiter., ,t is convenient for virtually all calculaUons 

^rrtedoutasdisclosed herein, usingcomputerswithdiscretememoty locations forsepara^^^^ 
lements, to d,sc,e..e a continuous trace into discrete intensities at specified locations or 
^scr^e post„o„s „, on the trace. As used herein, the delimiter „ replaces the delimiter , when 
.ntenstty uace has been discretized; i.e., aij<x, becomes a„,n) (see HG. 2). 

AS used herein, any data sets that have been transfonned using the calculaUons disclosed 
heretn are destgnated in upper case letters including an index and^or a delimiter. As noted J 
-^f^smayincludeatleastoneoperationchosenfmrnamongnom,^^^^^^^ 
^d scaltng A transfotmation that operates to combine the replicates of an individual while 

^^vmg.e.nd.vidualsofagroup.tac.results.natransfom,eddataseti„dicatedbyo„ei„d^^ 

^d^dua^^ofagrouptoprovideasingledatasetforanentiregroupisdesignatedbyadeu:^^^ 

2-*o™.forexample.byA(n,B,„,...,see™.:,Conversely..experimltal 
meutod of an^ysts or detection that do no. rely on developing traces, the A^n). BKn). ... are 

ob«tnedd,rec.ywi.hou,disc,.,i.ti„„. TlteymaysUllarisef^mreplicatesampK ow^^^ 
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For ..an,p,e. if „e de„cUo„ .eU,od . Based o„ 3n ana,, „„. o. „ore posiUo. i„ 4e ar^v .av 
r=pr«=m the results of one or more replicates, respectively. 

A particular embodiment of a data set envi^mn^H i. .1. 
, ^.^ '■"'>= Presem invention is differential 

display. In a differential display experimem involving gene expression mRNA • 

-mple, converted ,0 cDNA, and digested v.,h restrid " 
Patent No 5 871 1 "^"^ f""™™^ 

d.«, separated accordmg to lengUt using electrophoresis. Although nucleic acids consist of an 

irilesernumher Of nucleotides, their electrophoretic^sportpropertiesalsodepend™ 

nuc eotide composition. Electrophoresis expeHments cutrendy in use measure the lengd, of a 

nuc,«cac.d,.gmentde.em,inede,ect^^^^^^^^^^^^ 

^^.c^phore„cleng.hisusually>vithinl.„tof,heactu.numherofnuc 
ft.^™. The uit^tstty sign^ a(x, of the electrophoresis ..ce for sample A represents Ute 

™oun.offtagme„tsofe,ect.phoreUcle„g.x generated ftom the s^tple. Of course ^ 
tensity a<x, .so depends on the particular restriction en^mes used to generate ftagme!; fo. 

simplicity d.sdepe„denceissuppressedintheno.Uon. Somettmes die intensity at" . 
™dstoas.ngle<^ent;sometimesm.«ple,^,„,^..^_ 

are combined; sometimesnoftagmentsarepresentandacxjisahaselinesi^. Bed e 
a« ameasuredmtensi^.itshouldheapositivcuanUty. Madtematical ope^uL used 
dunng .g^l p^cessing. such as die subuaction of a baseline, might result in negative v^ues a, 
c^n locauons a(x,. If negative values exis, their magniu.de should be preferably of Jl 
order as U,e measurement etror in dte data set. J or the same 

TO detect differ«,ces, .he intensity a(x, from samples in group A is compared ™th the 

- i..b(x,generatedusinganidendcalpro.oco,fromsamplesingroupB. S meiii^vel 

^r^cesetweenAandBcanbeatuibutedtounderlyinggeneticvaria.^^ 

^ n. For examp e, samples A and B may inCu^ organisms or individu^s having an aUelic 

™^ onbetweenthemthatgeneratesaditre^ceinthemeasur^expressionlevelth^^^^ 
logic, relevance indiecontextoftheparticular experimental 

2 example,aneutra.ainglenucleoddepolymo.phism(SNP,canaddorremoveaha„d or 
2ea.n„.spreferableu.inc,udemu,Up,eorganismsorindividualsforsamplesA.^^ 
controlforthesetypesofindividualdiffetences. .n gener.. as noted eariier, th expression 



wo 00/41122 

J 2 PCT/USOO/00167 
profile of the fth individual of group A is denoted ai(x), and similarly bi(x) is the expression ' 
profile for individual i of group B. 

Fmbrnncr.. i, is preferable to have muWpJe expeHmemal r«plica,es of the expression 
profiles for each organism or ir,dividual. The jth expression profile, or replicare of the i* 
individual of group A is denoted as ayCx). and similarly for group B. "n-us, each .roup mav have 
one or more individuals, and each individual may have one or more replicates. More elaborate 
taeranchies are also possible and may be analyzed directly with the methods outlined below. 

An alternative embodiment of an experimental system relates to hvbridizauon. In this 
embodiment, let aij(x) represent the intensity ftom the j* experimental replicate of the i* 
organism in group A measured a. position x on a hybridization array or chip. Here x is a two- 
dm,ensional coordinate that identifies the location of a pardcular spot on the hybridization 
surface. The term trace is used herein to denote a one-dimensional data set. and the term arrav or 
hybndrzaUon data is used herein to represent a two-dimensional data set. Tenns such as data set 
stgnal. and intensity may represent one- or two-dimensional data sets. Furdtermore, repeated 
expenmenls, such as hybridization experiments conducted on a series of biological samples ' 
collected over time, may have additional dimensions. By way of nonlimiting example, each Ume 
potm m a time course study generates a twoKlimensional plane of data, and the Ume coordinate 
adds a third dimension. The methods disclosed herein are applicable to data sets such as these 
as well as ,o those of the preceding paragraphs. In fitll generality the methods disclosed hereil 
are generally applicable to any experimental study that generates multi-dimensional data sets In 
pantcular cases, attention may be restricted to a panicular dimensionality of data sets, as the 
specific character of the study may provide. Furihermore, as used herein the terms "signal" and 
"mtensity. may be considered interchangeable references to either measured, normalized, scaled 
or averaged data. 

Data sm can have many representaUons in the memory of a computer. Here it is 
assumed that each data set can be represented as a set of discreUzed intensity values, or elements 
m the data set Although it is not necessary to use a regular grid to store the intensity, it is 
convement to do so. Using a regular grid in one dimension, an intensity a(x) is stored a, 
locations x = 0, ^ 2ix. ... , lAx (where 1^ . L, and L represents the firll length of a trace in 
the longi»,din^ direction). ln~o dimensions, a(x)is stored at locatio„s(x,,x2)wh=rex, =0 
^2AX, ....Mx,andx, = 0,^y,2^y m.y(mAy = M). In d-dimensions, a(x) is stored ai 
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,'°r"; -kes „„ ,he values 0. 

^f°> "-P-^ to d,e i„,e„si. where = „,^,^ ' ' 

For electrophoresis dau, such as to generated by differendal-dispiav da,^ Ax ,s 
5 Pref^bly close 10 the reproducibility of the in™ P^>°^'2,'iMS 
va..eofO.U,isp,. Je. Kor lhyw~ 

Pixel i„a„.„a,c.spre.rable. Porpr! 2::;ir^"""~ 
--.n..represe„..J.j::--:~ 

Methods of Calculation, Algorithms 

n= invention allows for the idendficauon of a difference betwe.. h , 
data set contains data e,e.enu as descHbed above The d ffiZl T' 
on at least two transfomted data sets A(n) and Br ^ °P-n! 

w.....rences.texceedalow::::~^^ 

e:::r^^*— 

Masks for Noise, Saturation, and Peaks 

It is useful to mask out regions in the data rh.t a 

H.hHs..re.io..thathavea.hi.herin.r.„ati!::r::r::r'""^ 
intensity is too low or too hish for «n . ^^'^^ 

™«^...,recordsthis::n:fr:::e:~^^"^""^"^^'"-^^^'^"^ 

Tie noise mask m„ois5(„) depends on a noise level I • ,h„.i, 
experintental uncertainty in the measured intensity Js 1 " 
example, as *e standard deviation of the hTr . " 

sample , ■ mav», , Wank or control 

^oiijpic. ij^Qjgg may also be preferably a^^icm^r^ o ^ . 

lowend of the dynamic L. of IT ' ' '"'^ '° °' 

follows: '™™'™^"f*'^=««.on..m,ume„t. 1*= noise mask is calculated as 

• For each position n 
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o ^noise(n)=lifa(n)<Inoise 
^ ^noise(^) = 0 otherwise. 

T^c «„ .ask .,„(n) marks cau coUeced near U,= upper of U,e de,ec,io„ 
™g= of an .sm^em. The n,ask depends on a sa«ra,ion level, as follows: 
• For each position n 

o «isat(n)= 1 ifa(n)>Is3^. 
o nisat(n) = 0 otherwise. 

The toshold W is preferably close ,„ rte high end of >he dynan,ic ,an« of the 
detection instrument (0.95X or to IX). 

.ass '"'^ " " °°' ~ - = -ond 

pass that depends on a saturation xvidth w^^, as follows: 

• For each position n 

o m^^n) . 1 if a(„) is par, of a plateau of constant value over a range of m 

each dimension. Thus, to one dimension, if a(„).a(n+l).a(n+2)= ... . 
a(n+Wsat) and a(n) = aCn-l 1 = afn 9^ - - / 

^ 1 J a(n 2) - ... - a(n-Wsat). then nisat(n) = 1 for each 

of these pomts. Less preferably, n,3at(n) = 1 only for the center point- the 
remaining points require their ovm saturation checks. 
° "^satCn) = 0 otherwise. 

For differential display data Wco»Ax = n r,t ,0 r l , 

'^^ "•^"^'"P^^^^^'^ble, and Isat corresponds to the 

camera intensity at saturation. 

.™:r":r"^^ 

• For each position n 

o '"-"^«-=-ons,mp,^(„,= nfa(n.M')<a(„.^)forall^andAn. 
^. An. is ftrther than ^ , ^, , ^ ^ ^ ^ 

AnaadAn i<te,Ucal except for a single dimension in which they differ by I 
For this purpose, distances may be calculated by any of a number of methods ' 
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u.dud.g r,. Euclidean .e:Hc, the Manhattan n,et.ic. o. the .axi.u. absolute 
d^i^^rence. any dimension. In one dimension, = , 

<a(n-2)<a(n-l)<a(n)>a(n+l)>afn-2^> ^ r "' 
^ , , i;>a(n.2)> ... >a(n+Wpe^). Preferably, 

a(n wpe^) > i^^.^^ a(n-Wpeak) > Inoise as well. 
^ ° "ipeak(n) = 0 otherwise. 

For one-dimensional differential display data, w ^Av-n^ • . 
hvh^-^- • . ^ ^ ^'^^peak^-O.jnt IS preferable For 

hybndizauon, If an image has already been processed sT^^htK. u • 
Normalization 

> ^ normalized by first detf.TTT,m;«„ ♦u 

whe. .he peak r^, , . ^' 

individual values ,f de 1 ' T ' ' '"^"""'^ 

.3-.pe.e.,.eva> 

v...Ka...e.e..„.e.r:::e:^^^^^ 

poin.ac::::t:;rT!:":rr""^'"^'^"=^^^-^^^^^^^^ 
~sca,e..j:;T::et::r"^^^^^^ 

convenien.. ' "*"'"''™'=^"^.^''^""uchas lOOis 

^''=--«-=ho,dWemaya,sobesubjec..„,h=.an,e„on„aiizado„ A,„v 
Wemaybesettoafixedvalue Aoreferphl.r ^ . ^on. Alternatively, 

= 10. ' '''' "^'^ "^"^^ '^•ff--"^ display data is 

Averaging 

««»lW,a2(nj, ...,ai<n) IS calculated as 
A(n) = 2(i=i..r) wj ai(n) / [ Z(i=l..r) wj ] 

where wi is a weighting applied to data set i Ifzri-i n . . 

weigh, does no, vanish .o estoa,e A(n) Mos, f !^ '""'^ 

>™te A(n). Mos, preferably, fte value A(„) can be se, equal » Ae 
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value a. *c closes, poin, where U,e ™gh, does no, vanish. Me^advely. .he unweighted values 

aj(n) may be used. 

Any of a varies of weighting fitnctions may be used, and n,ay accoum for characteristics 
of a data setsuch as titose discussed in the following. A prefetred wetghting function is 
- msaUW) where m^g(„) is the satmUon mask for data s=, i. Weights may also incorporate 
e^or estimates from the data sets. For example, suppose that the data set ai(„, is known with 
suastrcal error ei(n). TTten a maximum likelihood estimate for A(„) is obtained by minimi^in. a 
chi-square statistic 

m=l..r)[A(n)-ai(n)]2/ei(n)2 
with respect to the final avei^ge A(n) to obtain = ,/eKn)2 or, if desired, = [] - 
msat,i(n)]/eKn)-. If ai(n) is itself derived from an average of other data sets, then the standard 
error of the mean is an appropriate choice for eKn). If ai(n) is an unnormalized data set then an 
appropriate choice for e^n) is the background noise level ^ defined previously. If ai(n) is a 
no„dataset,thenitisappropriate.^ 

It is preferable to calculate a standard deviation SD^Cn) to describe the distribution of 
data points leading to the average A(n). A preferable fonnula for SD^Cn) is 
SD(n) = [ Z(i=l..r) [A(n) - ai(n)]2 / (r-l)]l/2 . 

The standard error for the average is preferably calculated as 
E(n) = SD(n)/rl/2. 

Similarity and Difference 

In preparation for describing the scaling operation below, h is necessary to determine ti>e 
extent of ag^ement between two data sets. This ex.™, of ag^ement can be measured, by way of 
nonlmtiting example, by U,e dis.anee Dis.[A,B) or ti,e similari.y Sim[A.B) between nvo da.a seu 
A(n) and BCn). 

Two possible formulas for the difference Dist[A.B] are 

Dist[A,B] = w[A(n).B(n)] dist[A(n),B(n)] and 

Dist[A,B] = w[A(n).B(n)] dist[A(n).B(n)] / w[A(n),B(n)]. 
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The tenn dist[A(n),B(n)] is a funcuon that measures the distance between two values 
A(n) and B(n). The tenn w[A(n).B(n)] is a mask that determines whether the data points at 
location n should be included in the calculation. The second formula is preferable. 
A preferable formula for the distance dist(a.b) for two numbers a and b is 
dist(a,b) = [ln(a/b)]2. where ln() is the natural logarithm. Here a and b must be 
regulanzed to prevent values close to 0 from causing a divergence. TWs can be accomplished 

f^re^ample. by replacingaorbbyaminimumvalueI,i„ if either is smaller thanW 
addmg a positive constant to raise all values A(n) and B(n) above 0. 

Other acceptable formulas are as follows: 

the absolute difference, dist(a,b) = | a - b | ; 

the Euclidean distance, dist(a,b) = [ (a - b)2 ] 1/2 • 

the square distance, dist(a,b) = (a-b)2 ; or 

any non-negative function F(a,b) that is 0 wh«.n « - k • 

K'^uj mai IS u wtien a - b and mcreases with increasine ia-bl 
or increasing |ln(a/b)|. ' ' 

A preferable formula for the weight w[A(n),B(n)] is 
MA(n),B(n)] = WA(n)WB(n) if dist'(a,b) < D^^^ and 
w[A(n),B(n)] = 0 otherwise, 

where WA(n) and wb(„) are weights for the individual data sets, dist'(a,b) is a distance 
measure and D,,, is some maximum distance. Possible choices for the distance measure 
dist (a b) are^ 3s the choices for dist, but the same distance measure need not be used for 
both. A preferable choice is dist'(a,b) = |ln(a/b)| and D^^^x = 3. 

T^e weight WA(„) is preferably [1 - mA,noise(n)][l - n,A,sat(n)]. and similarly for 

WB(n). 0*er acceptable alternatives are to use either thpnr^-c- v ^ 

mc 10 use either the noise mask or the saturation mask or 

to use no mask and set WA(n) = WB(n) = 1 . 

It is also acceptable to use w[A(n),B(n)] = ] 

A similrty Sim(A,BJ heme™ „vo <taa s«s A and B .nay be defined as 
Sini[A,BJ = am[A(n)3(n)] 
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where U,e sWari,y sin,(^b, benveen nvo „™b=rs a and b is l.ge, when u,e ,u3n,i,i« 
are ^e...a,so,.g„whe„aandh.ec,o„,.,.. 

.(.^..«o..P.„..epo.,..h,ana.e.u..e.^^ 
a-b and ,u d,.ance d fron, U,e san,e line, as sho™ in fte figure be.ow. Uen define 
' sim(a,b) . p exp(-d2/2o2)/[27ia2] W where 

P = |a+b|/i/2, 
<l = |a-b|/i/2,and 

c characterizes the experimental noise in the data sets. 

10 e ofthe best hnear regression lineb = .a,^^^ 
residual of the pomts (A(n),B(n)) from the line b = ma, 

o = [ 2n 2 [ (mA(n) - B(n)) / (m+1) ]2 / (r-1) ]l/2 
where r-1 is the number of degrees of freedom in the fit. 
Other preferable formulas for sim(a,b) are 
sim(a,b) = F(p)G(d) 

where F(p) is an increasing ftncion of p and G(d) is a deceasing fin,cdon of d 

This algorita is ..iaied .o one of .he iite^tnTe as a n,e,hod of decomposing specra of 

-uiucomponen, nnxtures inro separa-e specra for each Of *e p.e component. 

Equivalent p^cedures for evaluating Ae sunU^^ and difference between da. s=,s is 

encompassed w„hin the scope of the present invention. 

Scaling 

Scaling is an operation that is applied to a subordinate data set a(n) to bring it in closer 
agreement with a master data set A(n) A scaHno»i„ ■ ""g « m closer 

... , ""^^'^""^ optimizes a scaling function srnW,^ 

minimize the distance or maximize the similaritv h.u u ^ ^ 

themasterdatasetA(n). ^'"'"'^ '^^^^ ^^^^^d slave data set s(n)a(n) and 

H^e scaling function s(n) may have various mathematics representaUons One 
representation is a basis set expansion 
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s(n) = 2:pCp(f)p(n) 

posu.o„ „, and p ™g„ „ver P ba.is toc,i„„s numbered p = Uo P A h»,- 

-^..........^pirtcc:"*^ 

^(") = cp.i+(cp-cp.i)(n-np.,)/(np-np.,). 
s(n) = 2:pcp,j,p(n), 

whereh.e„andpareboU.d-di.ensionaIandVn)canbeexp.s^^^ 
V°) = '}'pl(ni)(i)p2(n2) ... ,|,pd(nd) 

where nj and pj are the components of nanrfnJnw • • 
dimenslc., basis se, L dtaensio J. ' """^""""^ ^ - 

A prefemrf choice for the one-dimensional basis seK in • 
product is an onhogonai basis. A prefetred cho' T ""■""'■"^onal direct 

basis, ' ^ "-^"SO-I - a tngonometric 

■•■pjCn) - 'osCto-l Wn-njo)/(n-nji)] , 

where „j„ and nj, are the left-mos. and dght-mos, poin^ i„ di„,cnsion i In „ 
dunenston, for example, with poiMsn = 0 to I. " ""-^onj. Inone 
taction is '"°'«»«^P°"'""g'o distances 0 to L,U,ep* basis 

*p(x)-cos[(p.I),rx/L]. 

An advantage ofthis basis set is ih»t.h. I. j , 
fre^-cyvariations. Typically J "^''^ 

Provtaesagoodapproximationofthesc^Lgrcrr"'""™"'^'"'"'"""* 

-.).^..=...oiceis.ochoosea:r::f:::r::~^ 
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daa occurs on a lengU, scale of UP or longer. Preferably for differend^ display. L > 400 n. and 
U.e no.se leng* scale is approxin^teiy ,00 n., so P =, 4 is preferable. 1, is Uns fea...e of a basis 
se. such as Ae presently described basis se. to conrribmes significantly „ overcoming or 
eliminating the effects of low frequency noise. 

Oa,er accepuble basis seu include, for example. poly,„,mials (Wn) = nP). special 
ftncuons. and waveleu. and are well-kno™ in fte ar, See, e.,.. Press e, Numerical R.c,pes 
.N C. THE ART OP SCENTiPic COMPUTINO. Second EdiUon. Cambridge Univ. Press, Cambridge 
UK, 1 992, Chapters 5 . 1 2 andl 3. 

The coefficients c, are selected to mim,ni« the distance Dist(A(n),s(n)a(„)) or maximize 
fte strntlanty Sim(A(n),s(n)a(n)J. Methods to perform this optimization are well-known in me 
art Preferable methods are conjugate direction minimization or conjugate gradiem 
mmimization, which use linear algebra to optimize the P basis se, coefftcients simultaneously 
See, e.,.. Press e, a!.. Nu„er,ca. Recpes ,n C, The Artop SciENTinc COMPurmo, Second 
Ed,„on, Cambridge Univ. Press, Cambridge UK, 1992, Chapter 10. 

f"»P-=™«H„«r basis, apteferable approximation that is faster computationally ' 
^ a full minimization is to obtain Cp from an int^al sutrounding Up. preferably the interval 

tromnp.i to Hp+i, by minimizing the distance n^«rA<'„^^ ,/-M ■ 

^. , ^ ^ '^^^^'^LA(n),Cpa(n)] or maximizing the similarity 

Sim[A(n),Cpa(n)]. ^ 

Preferably for difrer«,Ual display, dre number of piecewise linear basis fimcUons is 
selected so that the ,ow-f.e,uency noise in the data occurs on a lengd, scale of L/(0 3P) or 
longer. With L = 400 nt and a noise length sc^e approximately ,00 nt. P » ,3 is preferable 

(interpolation points spaced every 35 nt). 

Iterative Scaling 

A group of data sets (ai(n„ can be btought into closer agreement with each other by first 
notmaltzmg each data set, then generating an ave^e A(„, then scaling each data set ai(n, to the 
average A(n), then repeating these steos If Hp«:,-rp.H tu^ 

B e steps, "desired, the average A(n) can be re-normalized after 

every iteration. 

Iterations continue until a termination condition has been satisfied. A prefetable 
~on condition is that A(n) has converged. This means that the d^tanee Dis,(A(n,^.(n„ 
between the value of A(n) after an iteration and its value A,n) after ti,e next iteration is small " 
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^ so.= U^shold va>u. A,.cn.a,,ve.y, ,h= scaling ta,io„s fo, each of .he .lave da« 
sc. ai(„) can be checked for convergence. A second preferable .enninauon condirion ,s d,a, a 
predetermined number of iterations has been reached. 

It is possible ,0 allow multiple ,ennina,io„ conditions, ^t, iterations ending after jus, 

one condition is satisfied. 

No« that the square distance nteasure essendally calculates *e standard deviation of the 
data sets. Thus, minimizing the square distance is ..sendally identical with perfonnin. scaling 
that mimmizes the standard deviation of the scaled traces. 

Iterative scaling may occur at a hietarchy of levels including experimental teplicates 
.ndependent individuals, and groups. Recall that the data set cotresponding to experimental' 
re^tcate J of organism i of group A is a«(n). Similarly, the data sets b,(n) are obtained for group 
B, data sets cjj(n) for group C, and so fonh for each of the groupings. 

within IT '""""7'°" ^'-^ ™ 

w.*m each organtsm, then widtin each group, and then between groups. One process is as 

• For each individual i in each group A. compute the avetage AKn) by iterative scaling of 
the data sets aij(n) as follows: 

o Each data set in aij(n) is normalized. 

o Initialize Ai(n) as the average of the experimental replicates. 

o ^^P-tthefollowingstepstmtilAKn)hasconvergedorthenumberofiterations • 
has reached a threshold: 

• Optimize the scaling flmcdon Sij(„) to bring aij(n) into best agreement 
With Aj(n). 

• Compute the new average Ai(n) from the scaled sets sij(n)aij(n). 

• Optionally normalize the new Ai(n). 

o Calculate the standard deviatinn 

aara deviation SD,(n) as a measure of the experimental variance 

between scaled data sets. 
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. For each group A. compu,= average A(„) by iterative scaling of 4e individual 
averages Ai(n) as follows; 

o Initialize A(n) as the average of the individual averages Ai(n). 

o Repeat the following steps until A(n) has converged or the nun^ber of iterations 
has reached a threshold: 

• Opdmize U>= scaling tocUon Si(n) ,o bring each Ai(n) imo bes, agreenten, 
with the group average A(n). 

• Compute the new average A(n) from the scaled individual averages 
Si(n)Ai(n). 

• Optionally, normalize the new A(n). 

o Calculate the standard deviation «;n , /'«^ oo 

u aeviaiion bD^(n) as a measure of the variance between 

scaled individual averages. 

. ''"*=f«l-«'i"8of8roups to each oto, perform one of Ae following ,^vo 

operations: 

o Option 1 : scale by cootputing the grand mean M(n) from all the group averages 

A(n), B(n), ... , as follows: 

• Initialize the grand mean M(n) as the average of A(n), B(n). ... . 

• Repeat the following steps until M(n) has converged or the number of 
Iterations has reached a threshold: 

• Optimize the scaling functions s^Cn), sB(n), ... , that bring A(n), 
B(n), ... , into best agreement with M(n). 

• Compute the new M(n) from the scaled group averages s^^Afn) 
SB(n)B(n), .... ^ ' 

• OpUonally, normalize the new M{n) 

• Calculate the standard deviation SD(n) as a measure of the variance 
between scaled group averages. 
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o Option 2: select one ofthe groups R(n) as a reference and scale the remam^^^ 
groups to R(n). Calculate the standard deviation SD(n) as a measure of the 
variance between the scaled group averages. 

Ustag process, 4= scaling «nns mus, be back-propagated ,o compare avera.es other 
to .he final, scaled group averages. For example, if scaled group averages are required 
s(n)A<„, is used. Ifscaled individual averages are required, .hen s^Si^AKn) is used. U sc^ed 
daa sets are required, then s(n)si(n)sij(n)aij(n) is used. 

'---^taP'eme„,a.ion,in.ennedia.eaveragesareno.re,uired.TOsimplem=n,a,ion 
reqmres a weighting method be selected. With a preferred weighting method, each data se, 
from mdividua, i is preferably given a weight proportional to l/(„umber of replicates from 
.ndtvtdual i, . gives each individual equal weight and prevents an individual with many 
rephca.« from dominating the average. Other preferable methods include weighting each data 
sc. equally and weighting each data sets to give each group equal weight. ,f each group has 
equal weigh,, one meti^od is ,o weight each r^Iicate equally. Thus, each data set from group A 
ts gtven a weight proportional to l/(number ofreplica.es from ^1 the individuals belonging to 
roup A,. An alternate method is to weigh, each dau set variably to give each individual witiUn 
a group equal weight. Thus, each data se. from individual i of group A is given a weight 
propottional to l/((number of individuals in group AXnumber of replicates in individu^ i„, 
Afierselecting a weighting metitod, apply the following algorithm: 

• Wtialtze the grand mean M(n) by averaging all the data sets aytn) according ,o the 

selected weighting method.. 

. Rep^ *e following steps until M(n) has converged or the number of iterations reaches a 

threshold: 

. Optimize tire scaling Actions Sij(n) to bring each aij(n) into best agreement witi, 

M(n). 

o compute ti,e new M(n) ii.m tire scaled data sets sij(n)aij(n) using the selected 
weighting method. 

o Optionally, normalize the new M(n). 
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. Calculate the individiml averages Ai(n) by averaging the scaled replicates sij(n)aij(n) 
belonging to individual i vnth weights according to the selected weighting method. 
Calculate the standard deviation SDi(n) within each individual. 

. Calculate the group averages A(n), B(n), ... , by averaging the individual averages Ai(n) 
according to the selected weighting mediod. Calculate the standard deviation SDA(n), 
SDb(")> - > within each group. 

For each of the iterative scaling steps, a preferable threshold for differential display data 
is 2 iterations. 

Jiggling 

One aspect of difference finding is comparing the heights of peaks in two data sets. In 
many data sets, the same peak may occur at different positions in different data sets. For 
example, a peak in one replicate of data set may occur at position n. while in a second data set it 
may occur at position n+1 or n-l due to experimental variability. (See FIG. 1) 

A jiggling algorithm identifies the peak height in a data set a(n) that corresponds to a 
given location n'. A preferred jiggling algorithm requires a parameter wjigg,, which describes 
the width of the jiggling window. 

The preferred algorithm starts at position n' and searches for the peak in a(n) closest to n' 
and withm distance wjigg^e- The height of a(n) at this peak position is temped the jiggled height 
of a(n) at n'. If two peaks are within equal distance, the higher value is preferably taken as the 
Jiggled height. If there is no peak within distance wjiggi, then the height a(n') is the jiggled 

height. 

For a oneKiimensional data set, for example. Ule dau r^ge for fte jiggling peak search is 
"■■Wjiggie ftrough n'*Wjigg,e. If „■ is a peak in a(n), then the value a(n') is d,eiiggled height of 

m at n'. Otherwise the positions „'±1, „.±2 n'ttvjigg,, are tested in nan for peaks; if a 

peak ,s fom,d at location „", then a(n") is the jiggled height of a(n) at n'. If no peak is found, 
then a(n') is the jiggled height. 

Less preferably, all of the peaks within distance wjjggj, of n' are examined and the 
maximum value is taken as the jiggled height of a(n) at n'. For a one-dimensional data set all of 
the peaks in the window n'-wjiggj,, ... ^ ^^^^^^ 
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~ value is ^.r, as *e Jiggled heigh, of ^n) a, ,f *e„ is „„ pea. i„ *e .„.„a, 
wen a(n') is the jiggled height. 

For differentia] display data, a preferable value is ^y,„^^^,AK = 0.4 nt. 
Difference Finding 

Diffe^nce finding identifies locations where a, leas, one of ,he groups has a peak and i,s 
value ,s s,gnificanUy different fton, ofter groups. The gn>up averages and individual 
averages produced by iterative scaling serve as inputs to difference finding. 

It is preferable to entploy an algorithm that usesjiggling to identify corresponding peaks 
.n dtfferent data sets. This avoids spurious differences due to slight offsets in peak positions. 

I. is also preferable ,o employ an ^goritlun that identifies at most one difl-erence from 
peaks that correspond. A prefer^le method employs a parameter w^^ titat defines the 
mtmrnum distance between differences. A preferable choice is Wdiff > Wp^. P„r differential 
display data, a preferable value is wjiffix - l.I nt. 

A preferred algorithm is as follows: 

. Generate a master peak mask Mpe,k(n) using one of the following alternatives: 

o Option ^^ForeachgroupAandindividualA,calcu,ateapeakmaskmp,^(n) 
from the individu^ average A^n). Tl.en. for each position n, Mpeak(n) is 1 if at 
least one of the individual peak masks is 1 and is 0 otherwise. 

o Option 2: Calculate the peak mask M .<'n\w,w»i ^ 

^ masK ivipea]j(n) directly from the grand mean M(n) 

of all the groups. 

o Option 3: If there are ody two groups A and B, generate a peak mask fbm ti. 
difference A(n) — B(n). 

• For each position n that appears in the neaV m^ci, , n , 

"^^^ Mpeak(n). calculate the significance of 

a difference as follows: 

o For each individual i in each group A. find tite jiggled height of Aj a. n. 
o Calculate the group averages based on the jiggled heights. 

o Petform an F-tes, ti,a. compares the vaHance between group averages with the 

variance within groups. 
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. , 9 As«>cia,e U,c p-valu. of ft= F-us, Ae position „. Also record U,e „u„,b=r of 
samples that had a peak at posiuon n. 

. Generate a list of peak positions sotted frotn lowest p-vaiue to Itighes, p-value If two 
postnons have tl.e santep-value, break the tie by listing first the posiuon with more 
sample peaks. Break any remaining ties by lisUng firs, the lower position. 
• Repeat the following steps until the list is empty: 

c Remove the firs, element from the lis, and record its position n as a difference, 
o Strike out any remaining elements in the Ust that are within distance w,„„f „ 
For a one-dimensional data set, for example, strike out any differences at 
positions n±l,n±2 nlwaiff. 

For difference fmding with two groups A and B, a pooled variance t-test may be 

employed instead of an F-test. 

. ^'-P-f-«'algori,hmforac„mparisonbetween,wog,oups,Aa„dB.ando„e. 

dimensional data set, is as follows: 

• Perfom, the final step of iterative scaling by scaling A(n) to B(n). 

. Calculate the peak mask Mp^(n) from the difference A(n) _ B(n). 

. Generate a lis. of peak positions sorted from smallest n to largest n. 

. Initialize a variable LASTPOSITION as 0 and a variable LASTOIRECTION as 0. 

• Repeat the following steps until die list of peaks is empty: 

o R™-=*=te=taentnfromAelistandcalcuIa.ep.value(n)fromat.tes, 
comparing thejiggled heights of individuals from group A to the jiggled heights 
ofrndtvidualsfromgroupBatposttionn. Ifdie average ofthe group A heights is 
larger than die average of the group B heights, then direcdo„(n) - ; otherwise 

direction(n) = -l. 

o If directions is not equal ,o LASTDIRECTON, or if („ _ LASTPOSITON) > 
Wdiff, then 

• If LASlTOSmON is not 0. save LASTPOSmON as a difference with p- 
value equal to LASTPVALUE 
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• Upda:eLASTPOSITION = n,LAS™TION = direaion(n).and ' 
LASTPVALUE = p.value(n). 

o ■ Otherwise if p-va]ue(n) < LASTPVALUE then 

• Upda. LASTPOSmON = , LASTBIRECTION = d.ection(n). .d 
LASTPVALUE = p-vaiue(n). 

o Otherwise 

• Continue with the next peak from the list. 

With p-vaiue equal to LASTPVALUE. 
EXAMPLE 

Example 1. Differential Gene Exores^nn i„ du ... 

nei!,xpression in Phenobarbitol-Treated Rats 

-respond. .0 Em 00 (U,= ^ j J J " ™^ ^"^^^ 

~ group, ^ an addWon. ^^^1 7 

group. ™* 'o «rv= as conttol 

Rats were sacrificed 24 hours aftpr ti,o r i j 
roll f . n ''^^^'"^''^^^^^d their brains were harvested 

rrd::::r~^^^^^^^^^^ 

point every 0.1 nt. The averages of the 3 renr . discretized for a 

nnr^ r • ^ '""^^^ """^ ^'o'' ^^^h individual Drior to 

normahzation or scaling, are shown in FIG 4 Each tr... ■ . 

weren^aslcedforeithernoiseorsaturation.^ 
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Next, noise was madced using I^. - 500 and w„ - 5, and peaks were identified ,„ each' 
repl.ca,e using . 3 wiU, Ae condition ,ha, all 7 points contributing to a pealc be above the 
notse level and no, santrated. TTte peaks were soned in increasing order of heieht. and each 
rephcate was norm^ized to give the 75* percentile peak ^ intensity of 100. The 3 nonnalized 
traces were averaged for each individual, and the individual averages are displayed in FIG. 5. 

For all the scaling operations that follow, the basis set used was piecewise linear »i,h 13 
scalutg points located every 35 n, beginning at 30 „, and ending a, 450 m. The distance function 
was tln(a(nyA(n)))=, and points for which iln(a('nyA(n)]| > 3 were masked out. 

11.= ftrst step in tite scaling procedure was that the 3 nomtalized traces for each individual 
were averaged, scaled to the average, re-averaged, then re-scaled to the average for 2 rounds of 
nemtive scaling. Next, the phenobarbital-reated individual average and the s.erile-wa.er-«ated 
.ndtvtdual average were themselves averaged, the individual avemges scaled to the grand 
average, then the process repeated for 2 rounds of iteraUve scaling. Finally, the phenobarbital- 
treated mdividual average was scaled .„ the sterile-water-treated individual average. The final 
scahng factors are shown in FIG. 6 for the two individuals. The fmal individual averages are • 
shown in FIG. 7. 

With both d,e normalized tn«=es and the normalized and scaled traces, difference fndtog 
wasperfonnedusingw^.2,w,^.-4,andw.,= II. Sigriflcance levels were calculated as 
1-p-value from a t-test based on the scaled replicate traces. See Table 1, below Only 
dtfferences ™,h a significance greater than 0.9 and a raUo |l„(PhenobarbitaWater)| > 1„(1 5) 
were retained. Tlte differences identified are listed in the table below (significances greater that 
0.99 axe reported as 1). TTte normalized t^ces generated 35 difrer«,ces, wher^ the scaled 
^acesgenented > 8 differences, of which 12 were in common with the normdized traces. The 
dtfTerences in the nonnalized traces that are removed by scaling tend to have lower significance 
to, the differences that are retained, indicating 4at scaling helps identify the diffet^nces with 
greater support in the data. . 
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Table 1 



51.9 

53.6 

67.3 

8].g 

87.1 

1032 

120.8 

127.5 

154.6 

158.9 

165.0 

170.0 

190.1 

205.7 

218.6 

228,1 

233.0 

263.1 

274.1 

280.0 

303.2 

331.8 

340.6 

352.2 

353.7 

354.9 

388.8 

394.5 

395.7 

402.4 

404.7 

406.4 

437.7 

443.1 

447.9 



p-value) 



Normalized Normalized and 
Only Scaled 



1 
1 

0.94 
0.99 



0.97 
1 



0.98 
0.94 
1 

0.91 
I 

0.98 

0.99 
0.96 
1 
1 
I 

0.96 
] 
1 

0.99 
0.98 
I 



I 



0.97 
1 
1 

» 0.95 
I 

1 
1 

0.98 

I 0.99 



1 



J 0.91 
1 



1 

0.99 
1 

0.9 



0.99 



Equivalents 
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contemplated by the inventor that various substitutions, alterations, and modifications mav be " 
made to the invention without depaning from the spirit and scope of the invention as defined bv 
the claims. For instance, the choice of algorithms used to transform data sets, such as 
normalization calculations, averaging calculations or scaling calculations, or the choice of data 
sets to be analyzed is believed to be a matter of routine for a person of ordinary skill in the an 
with knowledge of the embodiments described herein. 



c 
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1 . A metood of idcndfytag a differ«,ce be^v«n a, leas, mo groups, meU,od 

comprising: 

a) providing a firs, group having one or more .tenrenu in a firs, da,a sat; 

b) applying a. leas, one IransformaUon ,o said firs. da.a se, ,o provide a ,ransfonned da,a 
se,, wherein said ttansfonnauon is a c^cula,io„ seleced from a nonnMizin. calculation an 
averaging calculation and a scaling calculation; and 

c) distinguishing differences, if pr^en, be.ween elen,e„.s of said fus, .ransformed da,a 
se. and a second groups having one or more elemems in a second dau se,; 

tiiereby idenrifying a difference betiveen tiie groups. 

2, ■"■'-n^fl'odofclaiml.whereinsaidseconddau.se.isatiansformeddaase., 

3, •n.^nreti.odofclaiml.whereinsaidtiansfonnatiouminimi^snoiseassocia.ed 
w„h one or more signals associated wi* element of a, leas, one da,a set 



4. The method of claim 3 wherein loi^ ■ , 

«m J, wnerem said noise comprises low frequency noise. 



5. The method of claim 3 wherpin caJ/i 

J, wnerem said noise compnses jiggle. 

6. The method of claim 5 wherein «aiH ;;n«t^ • i j 

J, wnerein said jiggle mcludes positional shifts of 

corresponding elements between said first and second data sets. 



7. 
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8. The method of claim 7, wherein said trace represents the result of an 
electrophoretogram or a chromatoeram. 



9. The „=U,od of claim ,, wherein ele,„ems of a, lea., one dau se, co„,prise one o: 

more posiuons in an array. 



10 The me*od of claim 9. wherein any one position in ti,e an^y is used ,o detennine 
an exten, of marching between a reagem affixed .0 rhe anay a, said position and a sample 
contacting the array position. 



11. The me^odof Cain, >0. wherein Aereagen, is a fn.. nucleic acid and, he samole 
compnses a second nucleic acid, wherein ti,e second nucleic acid is in a hybridization solution 
m^tiTe, and A= marching constirutes hybridization of a sample nucleic acid ,0 tioe aflixed ' 



nucleic acid. 



12. The method of claim 1 wherein at loo.* j 
. , ' ^^^^ data set comprises elements derived 

from an analysis of one or more differentially expressed nucleic acids. 

oneo .^'T''^'''''^'"''^^^^^^^^^ 
one or more mdividuals in a group. 



operation, 



14. Tl« metirod of claim 13, wherein a, leas, one dara se. is subjec«d ,0 a masking 



15. 



'3. the el=me„.s in ftedause^ comprise values 

Zt an analysis of differentially expressed nucleic acids in rhe plur^i.y of individu^s in 
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16. The m=*od of claim ,5, wherein . dau se, obt^ned for each group is obtained bv 
applymg a, leas, one ,ransfo,n,auo„ ,o Ac data se, fiom each individual, wherein said 
.ransforma,lon is seleced from U,e g™„p consis^ng of a „o™a,i.ing calculadon, an ave,a.,n. 
calculation and a scaling calculation. " ° 



18. T^^--hodofclain.l5.whereinthedatasetfromanyoneindividualorreplicate 
compnses a data set derived from a trace of an electropherogra. or a cl^omatogran.. 

19. ^-^^odofciain.l8,whereinsaiddatasetistransfor.edbyapp,yingatleast 

one ofa normalizing calculation, an averaging calculatinn.nH r , 

veragmg calculation and a scaling calculation to the data 

set irom each individual or replicate. 



20. The meftod of claim 18, wherein said da,a se, is discreti^ed prior to the 
normalmng, u,e scaling and^or die averaging caJculalion. 



calculalln. "^'"""""'''^ ' -'^-^^----P'^-nom^iza.lon 



n. TT=-«h<"iofclain,21,wherei„said„onnalizaaoncaicula.ioncon.pris=s 
^...ng each da. se, such ^ a subse, of elenten. in each dau se. has sinUlar or idenUcal 
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23. The method of claim 21, wherein at least one data set is 

operation. 



is subjecied to a masking 



24. The method of claim 1, wherein said transfomiation 
calculation. 



comprises an averaging 



an 



25. rae mcAod of Cain, 24, wherein said averaging calculation comprises calcularino 

avera8eforaloca,ionorf„radiscre,i^dposi,ionofsaidf,rs.andsecondda«se.s and ' 
wherem U,e average may be an „„weigl„ed average or a weigl,ted average 



26. The method of claim 1 wherein tra«.r^ 

I, wnerem said transformauon comprises a scaling 

calculation. 



27. ^=-'hodofclain,26.v*ereinsaidscalingcalculadoncon,prisesacalcnla,ion 
*a. causes a flrs. dara se, » resemble a second da. se, provided an element in rbe scaled 

da. se, whose in.ensi,y differs significanrly from .he i„,e.i:y of *e element in ,he second 
data set a, .he same location or the same position contribu«s to identiiying the difference 

between the data sets. 



28. ^'-*odofc,^m27.whe.i„scalingcompris=sca,cula,ingasca,i„gftnction 
based on opumimion of a distance between the data sets. 

29. •n«">«hodofclaim27,wh=r=inscali„gcomprisescalcula,ingascalingftnction 
based on opimnzation of a similarity between tiie data sets. 



function, 



30. The method of claim 27. whetein the scaling calculation employs a scaling 
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31. The me*od of c.ai„ 30, wherein U,e scaling funcdon is a basis s=, expansion. 

32. T,.e merhod of claim 3,. wherein U,e basis sc, is a piecewisc linear basis set. 

33. The meU,od of claim 3 1, wherein fte ba.is se, is a Fourier series. 

iirncdor """'^ " P'°^- Of hasis 



i.«a Jof T?" *° « --ve 

averaging calculation and a scaling calculation. calculation, an 



^''^'"^*°'^°^^^^-^5'-h-inthetenninationconditionisn.etwh^^ 
transformed data set has converged. 



a 



or^de, "'■ . ™-»dMo„ism«whe„a 
P«de«r»ned numher of i,e,.U„ns of a cycle of calculations has heen reached. 

-coidiTr:^:''"''""-"^^-™------ 



39. Theme*od of claim '.whereinAedisdnguishingofdifferencesbenveendre 
=.~d«re..dng..se.comp.esanapplicadonofa.lea.onedi.e^^^^^^^ 
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40. A display displaying a «presenB,i„„ „f a difference be^..een two or more 
^ansformed daa «u. wherein eacl, dau s=, comprises ordered elemenis; and U,e da,a se>s axe 
.r^fornied by a, leas, one calculadon selected fiom U,e group consisting of a nomializing 
calculation, an averaging calculation and a scaling calculation. 



41. The display means of claim 40. ;vher=in the representauon Is obtained bv a 

process comprising the steps of: 

a) providing a first group having one or more elements in a first data set; 

b) applying at least one transformation to sa.d first data set to provide a transWd data 
set. wherem said transformation is a calculation selected from a non^alizing calculation an 
averagmg calculation and a scaling calculation; and 

c) distinguishing differences, if present, between elements of said first transformed data 
set and a second groups having one or more elements in a second data set; 

thereby identifying a difference between the groups. 



set. 



42. The display means of claim 41, whet^in said second data s=, is a transformed data 



43. The display means of claim 41, wherein said Wotmation mirdmizes noise 
associated ™th one or more signals associated Mth elements of a. least one data set 



44. ■n>=<ii=P% means ofclaim 43, wherein said noise composes low frequency 

noise. 



45. -nie display means of claim 43, wherein said noi 



noise comprises jiggle. 
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46^ The display .earn of ciain, 45. wherein said jiggie includes posi.iona, shifts of 
coiresponduig element baween said firs, and second data sets. 



a trace. 



47. TTte display means of clain, 41. wherein elentents of a, least one data set cotnprise 



48. Thedisp,aymeansofclain,4,,„herei„saidtracerepresen.3,he,.s„„ofan 
electrophoretogram or a chromatogram. 



49. The display .ea„s of claitn 4,. wherein elements of a, leas, one d^a se, comprise 
one or more positions in an array. "-omprise 



50. The display means of claim 49 wherpm or,,, 
^ , . ' ^"^^^'n any one position in the array is used to 

^™ne.e„=n.„fn.a.chi„ghe.eena.agen.affi.edto.hearraya.s.d,^^^^^ 
sample contacting the array position. 



51. The display means of claim 50 whprpm tj,^ • ^ 

' is a first nucleic acid and the 

sample comprises a second nucleic acid wherein th^c^o a 

. '^^"""^ ""'^'eic acid is in a hybridization 



52, •^=<'Maymeansofclain,41,whe.eina,leas,oneda.ase,compriseseleme„,s 
denved fen, an analysis ofone orntore differenUaily expressed nucleic acids. 



53 The display means of claim 41 , wherein at least one data set is derived from 



analysis ofone or more individuals 



an 



ma 



group. 
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54. The display means of claim 53. wherein at least one data set is subjected to a 
masking operation. 



55. 



The display means of claim 53, wherein the elements in the data sets comprise 
values derived from an analysis of differenually expressed nucldc acids in the pluralitv of 
individuals in a group. 



56. The display means of claim 55. whe,^i„ a data se, obuined for each group is 
obtamed by applying a. leas, one „ansfonna,io„ ,o fte da« se, from each individual, wherem 
sa.d ^ansfonnauon is selected from .he group consisting of a „om.lii„g calculation an 
averagmg calculation and a scaling calculation. 



57. 



The display means of claim 55. wherein each individual provides at least one 
replicate sample. 



58. display means ofclaim 55, Wherein fte data se. from any one individual or 

rephcare comprises a dau se. derived from a .race of an e.ec»pher„gram or a chroma.ogram. 



59, •n'='»Mayn>=ansofclaim58,»*ereinsaidda.as=.isaansfom,edbyapplvi„g 
« leas, one of a nonnaUzing calculaiion, an averaging calculation and a scaling calcularion'to L 
data set from each individual or replicate. 



60. The display means of claim 58. wherein said data set is discretized prior to the 
nomializmg, the scaling and/or the averagmg calculation. 



61. The display means of claim 41. wherein said transformation comprises a 
normalization calculation. 
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«. n.e display r.e^ „f ^lata 6, .wherein a. ieas, one da,a se, . subjeced .„ : 



masking operation, 



64. •^=^Maymea„sofclain,4,,wl,.reinsaid„a„sfonna,ionc„mprisesan 
averaging calculation. ""ipnses an 



^cuia, a„ avenge for a ,oca,on or for a discreri^ p„,,„„ „^ ^ J ^ 
s=.,andw..reu>U,eavera.en«y,eann„weisH,edave,a,eorawei,i„edavera,e 



calculation 



T.e display ^ean. of ciain, 4,. whe.i„ said WormaUon con.pHses a scaiin. 



c-Jiar::rr:::r;rTd^=^ 

4= scaled firs, dau set wl,o» • , " ^ ^ 

.eseeo„d..t^r r ::Zo:t:"^"""'^^^^ 

^.•fi- u P°^'^°" contributes to identifying the 

difference between the data sets. "lenuiying the 



in 
in 
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69. The display means of claim 67 wiierein ^raiir^r. 
, , ' ^'^^''"g compnses caJculating a scalino 

fUncuon based on optimization of a similarity between the data sets. 



70. The display means of claim 67, wherein the scaling calculation employs a scaling 



function. 



71. The .display means of claim 70. wherein the scaling function 
expansion. 



is a basis set 



set. 



basis functions. 



72. The display means of claim 71, wherein the basis set is a piecewise linear bas.s 

73. The display means of claim 71. wherein the basis set is a Fourier series. 
74 The display means of cl^ 70, wherein the scaling function is a direct product of 



75. ^^diMay means ofclaim 41. wherein the transfonnation comprises successive 
Iterations of a cycle of calculations that are carried out umil a sne T h ■ 
beensati,f;.H «K • u , ""^'^ ^ ^P^^^^d termmation condition has 

oeen satisfied, wherem the calculations comprise at least nn^«f ,• . 

v-uuiprise at least one of a norma hzation calculation nr, 

averagmg calcuJation and a scaling calculaUon. calc-lanon, an 



transformed data set has converged. 



pred=«™„ed„»„„„n,=™io„3ofacyc,= ofca,c.,ado„sha.b.„^hed. 
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sets have been corrected for signal alignment. 



sets, wherein each data set comprises ordered elements. 



81. The representation of claim 80 wherein tK« 
comprising A. sKps of '^«=n«„on is obtained by a process 

^) P^viding a firs, group having one or more element in a flrs, data set- 



^t. wherein said transfonnadon is a calcuiation selected from = ,■ ■ 

iaiiuij selected trom a normalizing calculation =,t. 
averagmg calculation and a scaling calculation; and 

c) distinguishing differences if oresent K^t,,,^ i 
^«..asecondgronps..i„go„e:r:~I:2r^^^^ 
thereby identifying a difference between the groups. 



82. 

set. 



reptesentation of claim S,, wherein said second data set is a transformed data 



associa.L,.'^' ^"'^ noise 

-octated wtth one or more signals associated with eiements of at least one data se. 
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84. The representation of claim 83 wherein a • 
noise. '°"^P"^« ^^equency 



85. The representation of claim 83 wherein c,;^ • 

oj, wnerem said noise comprises jiggle. 



a trace. 



T.e „Uo„ Of Cai™ a, . „,e„i„ e of a. .eas. one da. se. co.p.. 



89. TherepresentationofclaimSl wherein «i» ^ , 
oncormoreposMoosinananay. "~°f-'-on= data. euon,prise 



90. The representation of claim 89 wherein ,n, 

^=.=n.„eane„»,of„a.c«„«^^e„aClr::"'"°"'"*"'"^ 
san,ple conucing *= anay position. " " 



to 



91. The representation of claim 90, wherein the ; 



sample comprises a second nucleic acid wh^nT^ ^^^r^^^^ 
soiu.onniixt.e,andthematchingconrt^;^^^^^^^^ 

affixed nucleic acid. ^ybndization of a sample nucleic acid to the 
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92. T^"=P— "on„fc,aim8,,„Li„a„eas.o„=da.s«compn«s.™e„,s 
denvad fro„ an analysis of on« o, more difr=„nUal,y e.p„ssed nucleic acids. 



9j. The representation of claim 81 wherein at l«« ^ ^ 
. ' ^"^ '^^^^ 'lata set IS derived from an 

analysis of one or more individuals in a group 



.askin ^''^ '^"^ '^^^^ - -^i-^ - ^ 
maskmg operation. j 



95. 



93, wherein the elements in the data sets comprise 
values derived from an analysis of different!, ik, . comprise 

: , . ' ^'^^'^^"^^^"y expressed nucleic acids in the plurality of 

individuals in a group. ^ 



96. 



ob.i '~''"°"°'"^-"«'"'>-"^^-tobui„edforead,g:oupis 
2"=^ .apply,„,a.le..„n.^f„_,„„,^,^,^,^„_^^ J 

sa-d ™sf„™a,.o„ is sclcccd fto. U,e ,.„p consisting of a nonnalLn, calculalio an 
averaging calculation and a scaling calculation. 



97. 



.ep.ica.esan,p,r~°"°"''"'"'""""'^=^'"^'^'^----''-'on= 



98. The representation of claim 95 wherein th^ A.f 
V wnerein the data set from any one individual nr 

replicate comprises a data set derived from »t ^ ne maividual or 

derived from a trace of an electropherogram or a chromatogram. 

atleastot f ''^"'"'''^^"''"'^^^^''^^^'^'^^^^^^--^^^^^ 
aaia set trora each individual or replicate. 
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1 00. The representation of claim 98, wherein said data set is discretized prior to the 
normalizing, the scaling and/or the averaging calculation. 



101. TTie representation of claim 81, wherein said transformation comprises a 
normalization calcuJaiion. 



102. The rcpresenmion of claim 101, wherein said normalization calculation 
comprises adjusting each data se, such tirat a subset of elements in each data set has similar or 

identical values. 



103. He repr^ntation of claim 101, wherein at least one data set is subjects! to a 

masking operation. 



104. The representation of claim 81 , wherrin said transformation comprises a„ 

averaging calculation. 



105. ^'-P-^^lonofclaim 104, wherein said averaging calculation comprises 
calculating an average for a location or for a discreti^d position of said and second data 
sets, and wherein the average may be an tmweighted average or a weighted average 



calculation. 



106. The representation of claim 81, whe«i„ said transformation comprises a scaling 



107. 'l>=«presentationofclaim ,06, Wherein said scaling c^culation comprises a 
calculation .hat causes a firs, data set to resemble a second data set, p^vidcd an element in 



*e scaled firs, dau set whose intensity differs significantiy fiom the intensity of ti,e element 



in 
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difference between the data sets. ' 



.08. ™"=P™do„ofdaim,07,wter=i„scalingco„,priscscalcula,m.ascal«» 
fimc«o„ based o„ op,imiza,io„ of a disunce be,w™ A. dau sc. ' 

lO!.. Th"=P'«io» of claim ,07. Wherein seeing eo^prisescalculaUng a. calto. 
func,™ based o„ opdmiza,io„ of a similarity between 4, aata se.s. ' 



111. "^le representation of claim no wherein »h^c.,r • 
expansion. ''"'^"^ ^'^^^^ ^ ^^^^ 



set. 



112. The representation of claim 111 wherp,ntj,»k • 

ni I H, wherem the basis set is a piecewise linear basis 



113. ^^^-P'ay-ansofclaimlll.whereinthebasissetisaFounerseries. 
IM. T^^ 



of basis functions. 



115. 



The representation of claim 81 wherein tK»t r 

.-.onsofaeyc,cofca,cu,a.onsU.a.ecl:!rl;r°7:""^^^^^^ 
b-n sadsfied, wherein d. calcuiarions compl , " 

^v=ra.m.c.e^o„.,,3caiin.l:r"'"'°-""— ---^^^ 
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116. The representation of claim 115, wherein the termination condition is met when a 
transformed data set has converged. 



117. The representation of claim 115. wherein the termination condition is met when a 
predetemuned number of iterations of a cycle of calculations has been reached. 



118. The representation of claim 81, wherein elements within any two or more data 
sets have been corrected for signal alignment. 



119. The represenation of daim 8 1 , wherein ,he disUngui=hi„g of differences benveen 
*e elemenu of te resnhing data se« comprises an applicadon of a. leas, one difference fnrding 

algorithm. 
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